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Benchmarking computer platforms for lattice QCD applications 
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We define a benchmark suite for lattice QCD and report on benchmark results from several computer platforms. 
The platforms considered are apeNEXT, CRAY T3E, Hitachi SR8000, IBM p690, PC-Clusters, and QCDOC. 



1. INTRODUCTION 

Simulations of lattice QCD require powerful 
computers. We have benchmarked computers 
that are under consideration by the German Lat- 
tice Forum (LATFOR) [T] to realize the future 
physics program. The machines we have tested 
fall into three categories: (1) machines that are 
custom-designed for lattice QCD (apeNEXT and 
QCDOC), (2) PC-clusters, and (3) commercial 
supercomputers (CRAY, Hitachi, IBM). 



2. COMPUTER PLATFORMS 

2.1. apeNEXT 

The apeNEXT project [2] was initiated with 
the goal to build custom-designed computers with 
a peak performance of more than 5 TFlops and a 
sustained efficiency of about 50% for key lattice 
gauge theory kernels. apeNEXT machines should 
be suitable for both large-scale simulations with 
dynamical fermions and quenched calculations on 
very large lattices. The apeNEXT processor is a 
64-bit architecture with an arithmetic unit that 
can at every clock cycle perform the APE normal 
operation a x b+c, where a, b, and c are IEEE 128- 
bit complex numbers. The apeNEXT processors 
have a very large register file of 256 (64+64)-bit 
registers. On-chip network devices connect the 
nodes by a three-dimensional network. 



2.2. QCDOC 

QCDOC ("QCD on a Chip") 3 is a massively 
parallel computer optimized for lattice QCD, de- 
veloped by a collaboration of Columbia Univer- 
sity, UKQCD, the RIKEN-BNL Research Cen- 
ter, and IBM. Individual nodes are based on 
an application-specific integrated circuit (ASIC) 
which combines IBM's system-on-a-chip technol- 
ogy (including a PowerPC 440 CPU core, a 64-bit 
FPU, and 4 MB on-chip memory) with custom- 
designed communications hardware. The nodes 
communicate via nearest-neighbor connections in 
six dimensions. The low network latency and 
built-in hardware assistance for global sums en- 
able QCDOC to concentrate computing power in 
the TFlops range on a single QCD problem. 



2.3. PC-cluster 

In recent years, commodity-off-the-shelf 
(COTS) Linux cluster computers have become 
cost-efficient, general-purpose, high-performance 
computing devices. QCD simulations on clus- 
ter computers can be boosted considerably by 
exploiting the SIMD and data prefetch function- 
ality of Intel Pentium processors via SSE/SSE2 
instructions by means of assembler coding. The 
benchmarked PC-cluster has 1.7 GHz Xeon Pen- 
tium 4 CPUs with 1 GB of Rambus memory. 
The nodes communicate via a Myrinet2000 inter- 
connect. 
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2.4. CRAY T3E-900 

The CRAY T3E is a classic massively parallel 
computer. It has single CPU nodes and a three- 
dimensional torus network. The T3E architec- 
ture is rather well balanced. Therefore, the over- 
all performance of parallel applications scales to 
much higher numbers of CPUs than on machines 
that were built later. The peak performance of a 
T3E-900 is 900 MFlops per CPU, the network la- 
tency is 1 /xs, and the bidirectional network band- 
width is 350 MByte/s. 

2.5. Hitachi SR8000-F1 

The Hitachi SR8000 is a parallel computer with 
shared memory nodes. Each node has 8 CPUs. 
The key features of the CPUs are the high mem- 
ory bandwidth and the availability of 160 float- 
ing point registers. These features are accompa- 
nied by pseudo-vectorization, an intelligent pre- 
fetch mechanism that allows to overlap compu- 
tation and fetching data from memory. Pseudo- 
vectorization is done by the compiler. The peak 
performance of an SR8000 CPU is 1500 MFlops, 
the network latency is 19 /is, and the bidirectional 
bandwidth between nodes is 950 MByte/s. 

2.6. IBM p690- Turbo 

The IBM p690 is a cluster of shared memory 
nodes. Its CPUs (and nodes) have the highest 
peak performance of the machines considered but 
only a relatively slow network. In order to in- 
crease the bandwidth of the interconnect people 
divide the 32-CPU nodes into 8-CPU nodes. This 
increases the bandwidth per CPU by a factor of 
4. The performance depends to a large extent 
on the configuration of the machine. For bench- 
marking this architecture it has also to be taken 
into account that the performance drops by a fac- 
tor of 3-5 when using all CPUs instead of only 
one. The peak performance of a Power4 CPU 
is 5400 MFlops, the network latency (of the so- 
called colony network) is 20 /is, and the bidirec- 
tional bandwidth between nodes is 450 MByte/s. 

3. BENCHMARK SUITE 

In this contribution we concentrate on one par- 
ticular application: large-scale simulations of dy- 
namical Wilson fermions with 0(a)-improvcmcnt 



on lattices of size V = 32 3 x 64. We assume 
that these simulations are performed using the 
Hybrid Monte Carlo algorithm [3] or the Polyno- 
mial Hybrid Monte Carlo algorithm [S], as was 
done in simulations with dynamical fermions in 
recent years [H]. 

The most time-consuming operation is the 
fermion matrix multiplication. We denote the 
fcrmion matrix by M[U] = T[U] — H[U], where 
H is the Wilson hopping term 

H[U] xy = k^{(1- 7^) Un(x) S x+i x, y 

A" 

+(1+7m)0JO=- A) (1) 

and T is the clover term T[U] = 1 — |kc sw F^a^. 
Here we only consider the even-odd precondi- 
tioned version ip = H eo <p. 

Basic operations of linear algebra are needed in 
iterative solvers. We have considered the scalar 
product 

v 3 4 

w^) = EEE€W^w. ( 2 ) 

X — l 2—1 OL — 1 

the vector norm 

V 3 4 

iMi 2 = £££hM*)i a > ( 3 ) 

x— 1 i—1 a—1 

the zaxpy operation 

ipi,a{%) <~ i>i,a{x) +c<p itCt (x), c<EC, (4) 
and the daxpy operation 

^aW^-^aW+r^aii), r G K . (5) 

Two basic operations involving link variables 
are part of our benchmark, the multiplication of 
an SU{3) matrix by a vector 

3 

■0 = [/ * 0; ipj = y^Ujj4>j (6) 
and the multiplication of two SU (3) matrices 

3 

W = U*V; Wij =J2UikV kj . (7) 

Finally, the benchmark contains the basic op- 
erations involving the clover term, ip — Tcj> and 
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Table 1 



Benchmark results in MFlops per CPU. All numbers refer to 64bit floating point arithmetic. Italic 
numbers indicate that communications overhead has been included. Further details are given in the text. 





apeNEXT 


QCDOC 


PC-Cluster 


CRAY 


Hitachi 


IBM 


Peak [MFlops] 


1600 


1000 


3400 


900 


1500 


5200 


H co 4> 


894 


535 


930 


101 


632 


299 


foM) 


656 


450 


530 


148 


680 


303 


IMI 2 


592 


384 


510 


98 


789 


187 


zaxpy 


464 


450 


358 


114 


479 


234 


daxpy 


116 


190 


183 


57 


241 


115 




1264 


780 


307 


104 


811 


261 


u*v 


1040 


800 


763 


118 


1182 


413 


T<t> 


1136 


790 


800 


111 


1137 


608 



-0 = T"V- These were implemented with 6x6 
block matrices, 

*-\{ \ 1)(o S)(.i !)*' (8) 

4. BENCHMARK RESULTS 

Our benchmark results are listed in Table 1. 
The values for apeNEXT and QCDOC were ob- 
tained from cycle-accurate simulations of the 
forthcoming hardware. All the other performance 
numbers were measured on existing machines. 

On apeNEXT and QCDOC the hopping term 
was benchmarked by distributing the problem 
over the maximum number of nodes for the given 
problem size. For the PC-cluster, where a C 
code with SSE/SSE2 instructions based on the 
benchmark program of M. Luscher |7j has been 
used, only single-node numbers (for V — 16 4 ) 
are quoted because there is still some debate over 
which network to use (e.g., Myrinet, Infiniband, 
Gbit-Ethernet). The commercial machines have 
been benchmarked using 256, 64 and 64 CPUs on 
CRAY, Hitachi and IBM, respectively. We used 
the Fortran90 production code of the QCDSF 
collaboration that is parallelized with MPI and 
OpenMP. For the linear algebra routines we used 
Fortran loops on the Hitachi and the vendors' 
high-performance libraries on CRAY and IBM. 

In case of the scalar product and the vector 
norm we quote only the single-processor perfor- 
mance, since the performance including the global 
sum depends on the number of nodes. We esti- 



mated the overhead for computing the global sum 
on some platforms, since it will affect scalability 
of the considered application when going to a very 
large number of nodes: 

apeNEXT 5.2 ^s on 4 x 8 x 8 = 256 CPUs 
QCDOC 10 (15) lis on 4.096 (32.768) CPUs 
PC-cluster 138 (166) fjs on 256 (1.024) CPUs 

5. CONCLUSIONS 

We presented a selection of benchmarks rel- 
evant for doing large-scale simulations of QCD 
with dynamical fermions and provided initial 
benchmark results for a range of platforms. A 
more detailed comparison of these platforms in 
terms of price/performance ratio, hardware relia- 
bility, software support, etc. is beyond the scope 
of this contribution. These questions will be ad- 
dressed in a future publication. 
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